mlx_whisper: add support for audio input from stdin #1012

anthonywu · 2024-10-03T10:42:28Z

Problem

I wanted to pipe an audio file to mlx_whisper, but found it only accepted file paths. This PR will allow mlx_whisper to accept stdin and pass it to ffmpeg accordingly then allow the rest of the workflow to go on as usual.

Changes

load_audio helper adjusts ffmpeg flags based on file path vs. stdin mode
CLI parser will gracefully omit the otherwise-required positional audio arg if stdin is determined to be active
optionally, --input-name arg is supported to help users name the otherwise anonymous stdin content (cannot guess from file path)
added tests in macOS standard zsh file to drive and test the changes from the CLI

Process

ran black and pre-commit on changes prior to PR
python test.py shows 4 errors, some regarding floating point comparisons. Looks very far away from my change, may be known issues.

awni · 2024-10-03T14:06:35Z

Thanks for the addition. What do you think about a couple modifications:

For piping from stdin use - as in mlx_whisper - . That is what we do in MLX LM so it is more consistent.
The argument --input-name is confusing to me. I understand it now but I think it will in general be confusing. It might be more clear to allow an optional --output-names argument with appropriate defaults (basename when available or output when not).

anthonywu · 2024-10-04T01:07:09Z

For piping from stdin use - as in mlx_whisper - . That is what we do in MLX LM so it is more consistent.

Done. I agree self consistency between related projects is worth more than aesthetic preferences. This does have the nice effect of eliminating test cases.

The only tradeoff is users who reflexively think they can pipe anything into any tool's bare name will have to read the docs.

The argument --input-name is confusing to me. I understand it now but I think it will in general be confusing. It might be more clear to allow an optional --output-names argument with appropriate defaults (basename when available or output when not).

I've come around to --output-name and have proposed a "template" solution that preserves existing behavior, but also leaves room for future improvements such as fancy rename strategies based on transcribed audio content, or allows for power users to produce diff variations of output names when they use the same audio_path but use diff parameters.

anthonywu · 2024-10-04T01:47:02Z

whisper/mlx_whisper/cli.py

+
+ parser.add_argument("audio", nargs="+", help="Audio file(s) to transcribe")
+


my black==24.4.2 insists on re-formatting this line which didn't change

anthonywu · 2024-10-04T01:48:04Z

whisper/mlx_whisper/cli.py

+ parser.add_argument(
+ "--output-name",
+ type=str,
+ default="{basename}",


this defaults to pre-existing behavior, I would consider the default an implementation detail not for the user to be concerned about. If this is too much, we can default to None and handle the None inside the implementation.

anthonywu · 2024-10-04T01:49:55Z

whisper/mlx_whisper/writers.py

+ if isinstance(audio_obj, (str, pathlib.Path)):
+ basename = pathlib.Path(audio_obj).stem
+ else:
+ # mx.array, np.ndarray, etc
+ basename = "content"
+
+ output_basename = self.output_name_template.format(basename=basename)
+
+ output_path = (pathlib.Path(self.output_dir) / output_basename).with_suffix(
+ f".{self.extension}"
 )

- with open(output_path, "w", encoding="utf-8") as f:
+ with output_path.open("wt", encoding="utf-8") as f:


refactored using more "modern" Pathlib, see https://docs.astral.sh/ruff/rules/builtin-open/

anthonywu · 2024-10-04T01:51:00Z

whisper/test_cli.sh

+ --output-name "{basename}_mwpl_${test_val}" \
+ --output-dir "$TEST_OUTPUT_DIR" \
+ --output-format srt \
+ --max-words-per-line $test_val \


this I think can be useful for research, run a variation of transcriptions by adjusting knobs, then name those outputs appropriately

anthonywu · 2024-10-04T01:51:52Z

whisper/test_cli.sh

@@ -0,0 +1,69 @@
+#!/bin/zsh -e


this is a bare bones shell test harness to add coverage for now, without involving this PR in choosing a higher level shell test runner.

add support for audio and input name from stdin

bb5d7db

refactored to stdin - arg, and output-name template

b6435dc

anthonywu added 2 commits October 3, 2024 15:45

fix bugs, add test coverage

7b818c0

fix doc to match arg rename

266f99a

anthonywu commented Oct 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlx_whisper: add support for audio input from stdin #1012

mlx_whisper: add support for audio input from stdin #1012

anthonywu commented Oct 3, 2024 •

edited

Loading

awni commented Oct 3, 2024

anthonywu commented Oct 4, 2024 •

edited

Loading

anthonywu Oct 4, 2024

anthonywu Oct 4, 2024

anthonywu Oct 4, 2024

anthonywu Oct 4, 2024

anthonywu Oct 4, 2024


		parser.add_argument("audio", nargs="+", help="Audio file(s) to transcribe")

mlx_whisper: add support for audio input from stdin #1012

Are you sure you want to change the base?

mlx_whisper: add support for audio input from stdin #1012

Conversation

anthonywu commented Oct 3, 2024 • edited Loading

Problem

Changes

Process

awni commented Oct 3, 2024

anthonywu commented Oct 4, 2024 • edited Loading

anthonywu Oct 4, 2024

Choose a reason for hiding this comment

anthonywu Oct 4, 2024

Choose a reason for hiding this comment

anthonywu Oct 4, 2024

Choose a reason for hiding this comment

anthonywu Oct 4, 2024

Choose a reason for hiding this comment

anthonywu Oct 4, 2024

Choose a reason for hiding this comment

anthonywu commented Oct 3, 2024 •

edited

Loading

anthonywu commented Oct 4, 2024 •

edited

Loading